## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This dataset consists of 13 variables, with 1,599 observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
This plot shows that the majority of wines in this dataset are scored at 5 or 6. The minimum quality score is 3 and maximum is 8 in this dataset.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
This plot shows that the many of wines sampled have a fixed acidity of between 7 and 9 g/dm^3. When facet by quality, most data seems to be in quality score of 5, 6 and some in 7. The plots faceted by quality score seems to share the same bell curve shapes as the plot for the entire sample. Therefore, it’s hard to indicate whether there’s any relationship here.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Similarly, this plot shows that the many of wines sampled have a fixed volatile acidity of between 0.4 and 0.8 g/dm^3. When facet by quality, most data seems to be in quality score of 5, 6 and some in 7. The plot seems to share the same shapes as the plot for the entire sample. Therefore, it’s hard to indicate whether there’s any relationship here. It’s worth noting that removing the outliers here shows a better picture.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
This plot shows that most wines have a citric acid level of less than 0.75.
## Warning: Removed 84 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
When we transformed the first plot (long-tail) to the second plot by cutting outliers, we can see that the majority of wines have between 1.5 to 2.5 grams of residual sugar per dm^3. While 75% of wines have less than 2.6 g/dm^3 of residual sugar, the extreme case goes up to 15.5 g/dm^3 of residual sugar. Most red wines aren’t sweet! ( Over 45 indicates sweet wines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
This plot shows that almost all red wines have density of less than 1 g/cm^3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 74 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## <ggproto object: Class FacetWrap, Facet>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map: function
## map_data: function
## params: list
## render_back: function
## render_front: function
## render_panels: function
## setup_data: function
## setup_params: function
## shrink: TRUE
## train: function
## train_positions: function
## train_scales: function
## super: <ggproto object: Class FacetWrap, Facet>
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
After transforming the original plot of wines by Chlorides to eliminate the long tail, we can see that this is a normal distribution. Most red wines (75%) have less than 0.09 g/dm^3 of chlorides.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
This plot looks like a normal distribution as well. Most wines (75%) have a pH of less than 3.4. The minimum pH is 2.74 and maximum pH is 4 for all wines in our data set.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
This plot looks left skewed with the level of free sulfur dioxide as low as 1 mg/dm^3 and as high as 72 mg/dm^3.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
This plot looks left skewed with the level of total sulfur dioxide as low as 6 mg/dm^3 and as high as 289 mg/dm^3.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 58 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 58 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The majority of red wines in this data set have less than 1 g/dm^3 sulphates.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are 1,599 red wines in the dataset with 12 properties (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol percentage and quality). The variable quality is ordered factor variable with the following levels.
(very bad) —————-> (very excellent) quality: 0 -> 10
Other observations:
Most wines have a quality score of 5, followed by 6. The Median fixed acidity of red wines in this data set is 7.9 g/dm^3. A small number of wines have very high fixed acidity. Most red wines here have a volatile acidity between 0.2 and 0.8 g/dm^3 All red wines have a citric acid level of less than or equal to 1 g/dm^3. 75% of wines with citric acid level of less than 0.42 g/dm^3. Majority of wines have a residual sugar between 1.5 and 2.5 g/dm^3.
The main feature of interest in my dataset is Quality. I’m evaluating what other variables contribute or correlate with Quality.
I think volatile acidity, residual sugar, citric acid, pH, total sulfur dioxide and alcohol qill affect Quality of wines.
No.
No.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## citric.acid residual.sugar chlorides
## X -0.15355136 -0.031260835 -0.119868519
## fixed.acidity 0.67170343 0.114776724 0.093705186
## volatile.acidity -0.55249568 0.001917882 0.061297772
## citric.acid 1.00000000 0.143577162 0.203822914
## residual.sugar 0.14357716 1.000000000 0.055609535
## chlorides 0.20382291 0.055609535 1.000000000
## free.sulfur.dioxide -0.06097813 0.187048995 0.005562147
## total.sulfur.dioxide 0.03553302 0.203027882 0.047400468
## density 0.36494718 0.355283371 0.200632327
## pH -0.54190414 -0.085652422 -0.265026131
## sulphates 0.31277004 0.005527121 0.371260481
## alcohol 0.10990325 0.042075437 -0.221140545
## quality 0.22637251 0.013731637 -0.128906560
## free.sulfur.dioxide total.sulfur.dioxide density
## X 0.090479643 -0.11784967 -0.36837209
## fixed.acidity -0.153794193 -0.11318144 0.66804729
## volatile.acidity -0.010503827 0.07647000 0.02202623
## citric.acid -0.060978129 0.03553302 0.36494718
## residual.sugar 0.187048995 0.20302788 0.35528337
## chlorides 0.005562147 0.04740047 0.20063233
## free.sulfur.dioxide 1.000000000 0.66766645 -0.02194583
## total.sulfur.dioxide 0.667666450 1.00000000 0.07126948
## density -0.021945831 0.07126948 1.00000000
## pH 0.070377499 -0.06649456 -0.34169933
## sulphates 0.051657572 0.04294684 0.14850641
## alcohol -0.069408354 -0.20565394 -0.49617977
## quality -0.050656057 -0.18510029 -0.17491923
## pH sulphates alcohol quality
## X 0.13600533 -0.125306999 0.24512284 0.06645261
## fixed.acidity -0.68297819 0.183005664 -0.06166827 0.12405165
## volatile.acidity 0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid -0.54190414 0.312770044 0.10990325 0.22637251
## residual.sugar -0.08565242 0.005527121 0.04207544 0.01373164
## chlorides -0.26502613 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.07037750 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456 0.042946836 -0.20565394 -0.18510029
## density -0.34169933 0.148506412 -0.49617977 -0.17491923
## pH 1.00000000 -0.196647602 0.20563251 -0.05773139
## sulphates -0.19664760 1.000000000 0.09359475 0.25139708
## alcohol 0.20563251 0.093594750 1.00000000 0.47616632
## quality -0.05773139 0.251397079 0.47616632 1.00000000
Based on this correlation statistics (pearson r), the two variables that have some significant correlation with quality are: alcohol (r = 0.476) and volatile acidity (-0.39). This means that alcohol is likely positively correlated with quality (the higher alcohol, the higher quality) and volatile acidity is negatively correlated with quality (the higher volatile acidity, the lower wine quality).
By plotting volatile.acidity against quality and colored data points by quality, we can see a trend that higher quality wines tend to have lower volatile acidity. However, we do see a bulk of wines with quality score rating of 5 and 6 that share the same volatile acidity level. Some of the wines with ranking of 8 have higher volatile acidity than those of 5 or 6 ranking as well.
When using boxplots, we can see that the mean volatile acidity decreases as the quality of wine increases.
This scatterplot of quality vs alcohol shows that higher quality wines tend to have higher alcohol level. We do see a few exceptions of high quality wines that actually have lower alcohol level than lower quality wines.
When using boxplots, we can see that the mean alcohol level decreases from wine quality of 4 to 5, but from quality ranking of 5 and up, the higher the quality ranking, the higher the alcohol.
## Warning: Removed 126 rows containing non-finite values (stat_smooth).
## Warning: Removed 126 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
In this scatterplot of volatile acidity against citric acid, we do see somewhat a relationship here - the higher the citric acid, the lower the volatile acidity.
In this box plot, we seeing a similar trend of quality vs total sulfur dioxide with the trend of quality vs alcohol. After quality ranking of 5, the lower the quality of wine, the lower total sulfur dioxide.
Higher quality wines tend to have lower volatile acidity. Wine with quality of 5 and above, the higher quality wines tend to have lower total sulfur dioxide.
Fixed acidity and citric acid have a pearson r statistic of 0.67.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
This is a matrix of all graphs and charts for all variables.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## # A tibble: 6 x 5
## quality density median_alcohol median_density n
## <int> <dbl> <dbl> <dbl> <int>
## 1 3 0.99471 9.80 0.99471 1
## 2 3 0.99476 10.90 0.99476 1
## 3 3 0.99600 9.95 0.99600 1
## 4 3 0.99660 10.70 0.99660 1
## 5 3 0.99705 9.70 0.99705 1
## 6 3 0.99808 10.20 0.99808 1
This graph plots alcohol agains density and colored by quality.
These graphs show volatile acidity against citric acidity by quality. We can see perhaps a negative correlation between volatile acidity and citric acidity, but there’s no clear indication of correlation between quality and citric acid variable. We see fewer wines with high volatile acidity as the quality goes up indicating a correlation there. Althought the distribution shapes are pretty similar between quality of 5 and 6.
There seems to be correlation between alcohol level and density (higher alcohol level => lower density) and similar relationship between volatile acidity and citric acidity (higher volatile acidity => lower citric acidity).
Pretty interesting that there are only 2 standing out variables that are strongly correlated with Quality: Alcohol and Volatile Acidity. However, other features or variables have significant correlations with these two, influencing Quality as well.
The majority of the red wines in our data set have a quality score of 5 and 6.
According to this boxplot, volatile acidity decreases as the quality goes up, indicating a postive correlation here.
This plot shows a potential correlation between alcohol and quality as the higher the quality, the dots seem to be on the further right of the graph, indicating higher alcohol.
Through this exploratory data analysis, I could identify a couple of key variables that influence wine quality, which include: alcohol level and acidity. Other variables show long-tailed looking distributions. In my opinion, wine quality is a subjective measure however, there’s no standard calculation of wine quality - just individual ratings. Therefore, we see the strength of correlation as is. Further inferential study could be done to investigate such relationships in a deeper level.